As CCTV networks expand globally, law enforcement and investigators face the daunting task of manually reviewing massive video archives to find missing persons. This study introduces CrowdScan, a Python-centric analytical framework designed to automate this search through a four-tier logic: (1) adaptive frame decimation to remove visual redundancy, (2) motion-based filtering using absolute-difference calculations, (3) MTCNN detection featuring strict size and confidence thresholds, and (4) identity verification via 512-dimensional FaceNet embeddings. To minimize false alarms, we integrated a Laplacian blur filter alongside a secondary re-embedding verification pass. Our implementation remains hardware-agnostic, dynamically toggling between CUDA-enabled FP16 and standard CPU FP32 inference without manual configuration. We also provide an optional YOLOv8s pre-filter to narrow face detection specifically to human silhouettes. Our deployment on Hugging Face Spaces demonstrated the pipeline\'s efficiency, processing 47.9 MB of video in 95 seconds on a standard CPU instance. The system identified 55 ranked matches with a peak confidence of 91.9%. By generating standardized JSON and CSV forensic reports, the framework enables investigative teams to execute high-speed, scalable searches without the need for specialized high-performance hardware.
Introduction
CrowdScan, a cost-efficient AI system designed to speed up missing-person identification in CCTV footage by reducing manual video review workload.
The core idea is to avoid processing every frame (which is computationally expensive) by using motion filtering and adaptive frame sampling, allowing faster forensic analysis even on standard CPU hardware. This is important because large surveillance archives can contain millions of frames.
CrowdScan’s pipeline works in stages: it first samples video frames and removes static ones using motion detection, then uses MTCNN for face detection and FaceNet to generate face embeddings. These embeddings are compared using cosine similarity, followed by a double-verification step to reduce false matches. Results are classified into confidence levels (HIGH, MEDIUM, LOW). Optional YOLO-based filtering improves accuracy by focusing only on detected people.
The system was tested on a CPU-only environment using a real video, achieving 55 high-confidence detections with an average confidence of 87%, and peak accuracy of 91.9%, while processing video much faster than full-frame methods. Frame sampling and filtering significantly reduced computation load.
Compared to existing missing-person systems, CrowdScan achieves competitive accuracy while being CPU-friendly, low-cost, and more accessible, though it has limitations such as no cross-video tracking, reliance on a single reference image, and lack of large-scale benchmark validation.
Conclusion
CCrowdScan provides a practical, hardware-agnostic solution for forensic CCTV analysis, enabling investigators to search massive video archives using only a standard laptop and a reference photograph. By integrating a four-stage pipeline—frame sampling, motion filtering, MTCNN face detection, and FaceNet-based cosine similarity—the system maintains high-accuracy detection without requiring specialized GPU infrastructure. Our live implementation on Hugging Face Spaces successfully processed a 47.9 MB video in 95 seconds, yielding 55 high-priority detections with an average confidence of 87.0%. Beyond immediate detection, the system generates standardized JSON and CSV reports that align with forensic chain-of-custody requirements, ensuring that outputs are both actionable and reliable for law enforcement. The platform is publicly accessible at https://huggingface.co/spaces/mr7072/CrowdScan2.0.
The system\'s primary contribution is accessibility: an investigator with a laptop, a web browser, and a reference photograph can search video archives without GPU infrastructure, specialised software, or data-engineering expertise. On GPU-equipped hardware, the same system delivers faster throughput automatically through FP16 inference, making it scalable to larger archives as compute resources grow. The strong live results from CPU-only hardware demonstrate practical investigative value without the infrastructure requirements of server-GPU-dependent alternatives such as TRACE [3] and Tripathi et al. [5].
Future work will address five directions: (1) cross-video trajectory linking to consolidate detection records for the same individual across multiple video files into a unified appearance timeline; (2) multi-reference-image support using average embedding or ensemble voting to improve recall against appearance changes; (3) systematic evaluation on standardised benchmarks such as Market-1501 and DukeMTMC to produce formal precision and recall metrics; (4) empirical GPU vs. CPU throughput benchmarking to validate the FP16 acceleration path under controlled conditions; and (5) lightweight temporal deduplication to group detections from the same sighting event into consolidated records, reducing redundancy in the forensic output.
References
[1] J. Xu, Y. Li, and W. Zhang, \"Intelligent Video Surveillance Using Object Detection and Face Recognition,\" in Proc. IEEE ICIP, 2021, pp. 1–6.
[2] C.-H. Tseng, Y.-C. Lin, and C.-S. Chen, \"Multi-Camera Person Retrieval Using Visual Attributes,\" in Proc. IEEE ICCVW, 2021, pp. 1–8.
[3] A. Nadeem, A. Jalal, and K. Kim, \"TRACE: Missing Person Detection in Large Crowds,\" IEEE Access, vol. 10, pp. 32741–32755, 2022.
[4] K. Solaiman et al., \"Find-Them: Multimodal Missing Person Search System,\" in Proc. Int. Conf. ECCE, 2022, pp. 1–6.
[5] H. Tripathi, R. Sharma, and P. Gupta, \"YOLO-Based Face Crop and Recognition Pipeline for Surveillance,\" IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 4, pp. 1821–1833, 2023.
[6] B. Jiang, H. Li, and X. Wang, \"YOLO-FFRD for Small-Scale Pedestrian Detection,\" IEEE Trans. Intell. Transp. Syst., vol. 25, no. 2, pp. 1102–1115, 2024.
[7] Z. Ren, Q. Liu, and Y. Chen, \"LittleFaceNet: Recognition of Small Faces in Surveillance Footage,\" IEEE Trans. Inf. Forensics Security, vol. 20, pp. 441–453, 2025.
[8] W. Chen, X. Zhu, and H. Wang, \"End-to-End Person Search Using Joint Detection and Embedding,\" in Proc. IEEE/CVF CVPR, 2024, pp. 5671–5680.
[9] L. Wang, Z. Zhao, and F. Liu, \"Efficient CCTV Video Search via Key-Frame Extraction,\" Pattern Recognit. Lett., vol. 168, pp. 112–120, 2023.
[10] T. Nguyen, P. Tran, and M. Le, \"Forensic Video Search Using YOLOv8 and ArcFace,\" IEEE Trans. Biometrics Behav. Identity Sci., vol. 7, no. 1, pp. 88–100, 2025.
[11] F. Schroff, D. Kalenichenko, and J. Philbin, \"FaceNet: A Unified Embedding for Face Recognition and Clustering,\" in Proc. IEEE CVPR, 2015, pp. 815–823.
[12] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, \"Joint Face Detection and Alignment Using Multitask Cascaded CNNs,\" IEEE Signal Process. Lett., vol. 23, no. 10, pp. 1499–1503, 2016.
[13] G. Jocher et al., \"Ultralytics YOLOv8,\" 2023. [Online]. Available: https://github.com/ultralytics/ultralytics.
[14] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, \"VGGFace2: A Dataset for Recognising Faces Across Pose and Age,\" in Proc. IEEE FG, 2018, pp. 67–74.
[15] J. Park, S. Lee, and H. Kim, \"Adaptive Frame Skipping for Real-Time Video Surveillance,\" IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 7, pp. 4310–4323, 2022.
[16] H. Li, G. Lin, X. Shen, and S. Lucey, \"Face Detection Evaluation in Low-Resolution CCTV Video,\" in Proc. IEEE WACV, 2019, pp. 1–9.
[17] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, \"ArcFace: Additive Angular Margin Loss for Deep Face Recognition,\" in Proc. IEEE/CVF CVPR, 2019, pp. 4690–4699.
[18] Y. Liu, R. Chen, and W. Zhang, \"Content-Aware Temporal Sampling for Efficient CCTV Face Search,\" in Proc. ACM Multimedia, 2023, pp. 2145–2153.
[19] X. Huang, Z. Wang, and L. Li, \"Optical Flow Guided Frame Selection for Surveillance Video Analysis,\" Pattern Recognit., vol. 135, p. 109170, 2023.
[20] S. Kim, J. Oh, and M. Park, \"Face Quality-Aware Filtering for CCTV Recognition Pipelines,\" IEEE Trans. Inf. Forensics Security, vol. 18, pp. 3025–3037, 2023.
[21] H. Zhao, Q. Liu, and Y. Wang, \"Self-Supervised Blur Detection for Real-Time Video Face Recognition,\" in Proc. IEEE ICASSP, 2023, pp. 1–5.
[22] R. Chen, X. Li, and J. Zhang, \"Context-Padded Face Crops for More Stable Deep Embeddings,\" IEEE Signal Process. Lett., vol. 30, pp. 872–876, 2023.
[23] M. Ye, J. Shen, G. Lin, T. Xiang, L. Shao, and S. C. H. Hoi, \"Deep Learning for Person Re-Identification: A Survey,\" IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 6, pp. 2872–2893, 2022.
[24] H. Luo et al., \"Bag of Tricks and a Strong Baseline for Deep Person Re-ID,\" in Proc. IEEE/CVF CVPRW, 2019.
[25] P. Wei, R. Zheng, Y. Zhao, and J. Tian, \"Transfer Learning for Person Re-ID Across CCTV Domains,\" IEEE Trans. Multimed., vol. 24, pp. 3490–3502, 2022.
[26] X. Wu, R. He, Z. Sun, and T. Tan, \"A Light CNN for Deep Face Representation,\" IEEE Trans. Inf. Forensics Security, vol. 13, no. 11, pp. 2884–2896, 2018.
[27] S. Ioffe and C. Szegedy, \"Batch Normalization: Accelerating Deep Network Training,\" in Proc. ICML, 2015, pp. 448–456.
[28] N. Dalal and B. Triggs, \"Histograms of Oriented Gradients for Human Detection,\" in Proc. IEEE CVPR, 2005, pp. 886–893.
[29] Y. Sun, D. Liang, X. Wang, and X. Tang, \"DeepID3: Face Recognition with Very Deep Neural Networks,\" arXiv:1502.00873, 2015.
[30] A. Paszke et al., \"PyTorch: An Imperative Style, High-Performance Deep Learning Library,\" in NeurIPS, vol. 32, 2019.